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ABSTRACT 

In this work, we study the correlation between attribute sets 
and the occurrence of dense subgraphs in large attributed 
graphs, a task we call structural correlation pattern min- 
ing. A structural correlation pattern is a dense subgraph 
induced by a particular attribute set. Existing methods are 
not able to extract relevant knowledge regarding how vertex 
attributes interact with dense subgraphs. Structural corre- 
lation pattern mining combines aspects of frequent itemset 
and quasi-clique mining problems. We propose statistical 
significance measures that compare the structural correla- 
tion of attribute sets against their expected values using null 
models. Moreover, we evaluate the interestingness of struc- 
tural correlation patterns in terms of size and density. An 
efficient algorithm that combines search and pruning strate- 
gies in the identification of the most relevant structural cor- 
relation patterns is presented. We apply our method for 
the analysis of three real-world attributed graphs: a collab- 
oration, a music, and a citation network, verifying that it 
provides valuable knowledge in a feasible time. 

I. INTRODUCTION 

In several real-life graphs, attributes can be associated 
with vertices in order to represent vertex properties. In so- 
cial networks, for example, vertex attributes are useful to 
model personal characteristics. Moreover, vertex attributes 
can be associated with content (e.g., keywords, tags) in the 
web graph. Such an extended graph representation, which is 
called an attributed graph, may support graph patterns that 
provide relevant knowledge in various application scenarios. 

An interesting question related to attributed graphs is 
how particular attributes are associated with the topology 
of real graphs. In other words, do there exist patterns that 
explain how vertex attributes interact with the graph struc- 
ture? How can we extract and evaluate such patterns? In 
this paper, we study the problem of correlating attribute sets 
with an important topological property of graphs, which is 
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the organization of vertices into dense subgraphs. For in- 
stance, we aim to address questions such as: How docs a 
particular set of interests induce communities in a social 
network? What are the communities that emerge around 
such interests? Such questions are related to important so- 
cial phenomena such as homophily [11] and influence [2]. 
Although several definitions of dense subgraphs have been 
proposed in the literature, most of them do not take vertex 
attributes into consideration. Furthermore, such definitions 
do not provide any knowledge regarding how different sets 
of attributes induce dense subgraphs. 

This work studies the correlation between vertex attributes 
and dense subgraphs, a task we call structural correlation 
pattern mining. The structural correlation of an attribute 
set is the probability of a vertex to be member of a dense 
subgraph in its induced graph. Moreover, a structural cor- 
relation pattern is a dense subgraph induced by a particular 
attribute set. Figure 1 illustrates a dataset for structural 
correlation pattern mining. The vertex attributes are given 
in Figure 1(a) and the graph is shown in Figure 1(b). Ex- 
ample dense subgraphs are shown in Figures 1(c) and 1(d). 
The structural correlation of the attribute A is 0.82, since 
9 out of 11 vertices are covered by dense subgraphs in its 
induced graph. On the other hand, the structural correla- 
tion of C is 0, because there is no dense subgraph inside the 
graph induced by C. The structural correlation of {A,B} 
is 1, due to the fact that every vertex is a member of a 
dense subgraph in the graph induced by {A,B}. The pair 
({A,B}, {6,7,8,9,10,11}) is an example of a structural cor- 
relation pattern, for which the subgraph is shown in Figure 
1(d). Another example is the pattern ({A}, {3,4,5,6}), for 
which the induced subgraph is shown in Figure 1(d). 

The structural correlation of attribute sets and the struc- 
tural correlation patterns are complementary information, 
while the first is a measure of the correlation between a given 
attribute set and the occurrence of dense subgraphs, the sec- 
ond provides representatives for such a correlation through 
specific subgraphs. We formulate the structural correlation 
pattern mining in terms of two existing data mining prob- 
lems: frequent itemset and quasi-clique mining. Frequent 
itemset mining [1, 19] is applied to handle the possible large 
number of attribute sets from the graph and quasi-cliques 
[14, 10] are used as a definition for dense subgraphs. 

We study structural correlation pattern mining focusing 
on two important aspects. The first aspect is the significance 
of the patterns. More specifically, it is relevant to provide 
significance measures for the structural correlation of at- 
tribute sets and the structural correlation patterns. The 
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(a) Vertex attributes (b) Graph (c) Dense subgraph (d) Dense subgraph 

Figure 1: Structural correlation pattern mining (illustrative example) 



second aspect is related to the computational cost of the 
proposed task. Our objective is to enable the analysis of 
large real graphs in a feasible time. Although significance 
and high-performance are not necessarily concordant goals, 
we propose significance metrics that may lead to efficient 
pruning strategies for structural correlation pattern mining. 

Regarding the significance of patterns, we formulate nor- 
malization approaches for structural correlation pattern min- 
ing in order to measure the statistical significance of the 
structural correlation of a given attribute set. The idea is 
to compare the structural correlation against its expected 
value, which is provided by a null model. Moreover, we 
evaluate the structural correlation patterns in terms of size 
(i.e., number of vertices) and density (i.e., cohesion). Such 
evaluation is useful to rank the most interesting patterns. 

We combine the statistical significance of the structural 
correlation of attribute sets and the size and density of struc- 
tural correlation patterns with effective constraints to prune 
down the search space. Moreover, we propose two strategies 
for computing the structural correlation of attribute sets effi- 
ciently. These pruning and search techniques are integrated 
into the SCPM (Structural Correlation Pattern Mining) al- 
gorithm, which is described and evaluated in this paper. In 
particular, we apply SCPM to the analysis of three real at- 
tributed graphs: collaboration, music and citation networks. 
The results show that SCPM is able to extract relevant 
knowledge regarding how vertex attributes are correlated 
with dense subgraphs in large attributed graphs. 

2. STRUCTURAL CORRELATION 
PATTERN MINING 

2.1 Definitions 

2.1.1 Structural Correlation 

We define an attributed graph as a 4-tuple Q = (V, £, A, J-) 
where V is the set of vertices, £ is the set of edges, A = 
{ai, 02, • • • On} is the set of attributes, and T : V — > P(A) is 
a function that returns the set of attributes of a vertex. P 
is the power set function. Each vertex t>i in V has a set of 
attributes T(vi) = {an, ai2, ■ ■ ■ a ip }, where p = \T(vi)\ and 
J-(vi) C A. Figure 1(b) shows an example of an attributed 
graph where the vertex attributes are given in Figure 1(a). 

Given the set of attributes A, we define an attribute set S 
as a subset of A (S C A). Moreover, we denote by V(S) C V 
the vertex set induced by S (i.e., V(S) = {vt G V|S C 



J-(vi)}) and by £ (S) C £ the edge set induced by S (i.e., 
£(S) = {{v^vj) G £\vi,Vj G V(S)}). The graph G{S), in- 
duced by S, is the pair (V(S), £ (S)). We also define a sup- 
port function a, which gives the number of occurrences of 
an attribute set in the input graph (cr(S) = |V(S)|), i.e ., 
the number of vertices that contain S. 

The structural correlation function measures the corre- 
lation between a given attribute set and the occurrence of 
dense subgraphs in an attributed graph. We apply quasi- 
cliques as a definition for dense subgraphs. Quasi-cliques 
are a natural extension of the traditional clique definition. 

DEFINITION 1. (Quasi- clique) Given a minimum den- 
sity threshold ~/ m in (0 < 7min < 1) and a minimum size 
threshold minsize, a quasi-clique is a maximal vertex set 
Q such that for each v G Q, the degree of v in Q is at least 
\lmin-{\Q\ — 1)1 and \Q\ > minsize. 

Figures 1(c) and 1(d) are examples of an 1-quasi-clique of 
size 4 and a 0.6-quasi-clique of size 6, respectively, from the 
graph shown in Figure 1(b). The quasi-clique mining prob- 
lem consists of identifying the quasi-cliques from a graph 
considering minimum size and density parameters, a prob- 
lem known to be #P-hard [14, 17]. 

We define the structural correlation of an attribute set S 
as the probability of a vertex v with attribute S to be part 
of a quasi-clique in G(S). 

DEFINITION 2. (Structural correlation function e) 

Given an attribute set S, the structural correlation of S, 
e(S), is given as: 

jjggj 
\V(S)\ 

where ICs is the set of vertices in quasi-cliques in Q(S). 



e(S) = 



(1) 



In the graph from Figure 1, IC{a}={3, 4, 5, 6, 7, 8, 9, 10, 11}, 
!C{c} = {} an d 1C{a,b] = {6,7,8,9,10,11}, and thus the 
corresponding values of e({A}), e({C}), and e({A,B}) are 
0.82, 0, and 1, respectively. Structural correlation measures 
the dependence between attribute set S and the density of 
the associated vertices. It indicates how likely S is to be 
part of dense subgraphs. Our formulation enables the iden- 
tification of attributes that induce vertices that are well con- 
nected in the graph. In a social network, for instance, such 
attributes are of great interest since they may be related to 
homophily or influence. Nevertheless, it is also relevant to 
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understand the dense subgraphs induced by attribute sets. 
We call structural correlation pattern a quasi-clique that is 
homogeneous w.r.t. an attribute set. 

DEFINITION 3. (Structural correlation pattern) . A 

structural correlation pattern is a pair (S, Q), where S is an 
attribute set (S C A), and Q is a quasi-clique from the graph 
induced by S (Q C V(S)), given the quasi-clique parameters 
~/min and minsize. 

The pair ({A}, {3, 4, 5, 6}) is an example of a size 4 struc- 
tural correlation pattern with density 1 induced by the at- 
tribute A in the graph from Figure 1. Another example of a 
structural correlation pattern is ({A, B}, {6, 7, 8, 9, 10, 11}), 
which is a size 6 structural correlation pattern with density 
0.6 induced by the attribute set {A, B}. 

2. 1.2 Structural Correlation Pattern Mining Problem 

Based on the definition of structural correlation patterns 
and structural correlation function, we formulate the struc- 
tural correlation pattern mining problem. It comprises the 
identification of the attribute sets correlated with dense sub- 
graphs and the dense subgraphs induced by such attribute 
sets. We apply a minimum support threshold <r m ; n for at- 
tribute sets in order to prune down the number of patterns. 

DEFINITION 4. (Structural correlation pattern min- 
ing problem). Given an attributed graph G(V, £ , A, J-), a 
minimum support threshold a m in, a minimum quasi-clique 
density 7 min and size minsize, and a minimum structural 
correlation e m in, the structural correlation pattern mining 
consists of identifying the set of structural correlation pat- 
terns (S,Q ) from Q , such that S is an attribute set for which 
a(S) > a min , e(S) > e min , and Q is a 7 m i„- quasi- clique for 
which Q C V(S) and \Q\ > min_size. 

As an example, we consider the attributed graph shown in 
Figure 1 and the parameters a min , 'jmin, minsize and e m in 
set to 3, 0.6, 4, and 0.5, respectively. The set of structural 
correlation patterns are shown in Table 1. For each pattern, 
we give the pair (attribute set, dense subgraph), the respec- 
tive quasi-clique size and density (7), and the attribute set 
support (a) and structural correlation (e). 



pattern 


size 


7 


a 


e 


({A},{6,7,8,9,10,11}) 


6 


0.60 


11 


0.82 


({^1},{3,4,5,6}) 


4 


1 


11 


0.82 


({A},{3,4,6,7}) 


4 


0.67 


11 


0.82 


({A},{3,5,6,7}) 


4 


0.67 


11 


0.82 


({A},{3,6,7,8}) 


4 


0.67 


11 


0.82 


({B},{6,7,8,9,10,11}) 


6 


0.60 


6 


1.0 


({A,B},{6,7,8,9,10,11}) 


6 


0.60 


6 


1.0 



Table 1: Patterns from the graph shown in Figure 1 



Similar to the quasi-clique mining, the structural corre- 
lation pattern mining is #P-hard [17]. This is because the 
quasi-clique mining problem can be reduced to the structural 
correlation pattern mining by assigning the same attribute 
to each vertex from the graph and setting a miTl to 1. 

Structural correlation pattern mining is based on the struc- 
tural correlation function, which measures how a given at- 
tribute set is associated with the occurrence of dense sub- 
graphs in an attributed graph. However, it is important to 
assess the significance/interestingness of a given structural 
correlation, which is the subject of the next section. 



2.1.3 Statistical Significance of the Structural Cor- 
relation 

Given the structural correlation of an attribute set, how 
can we evaluate it? In other words, what can be considered 
a high or low structural correlation? In this section, we ad- 
dress such questions by proposing null models for structural 
correlation. These models specify the expected structural 
correlation of an attribute set assuming that the correla- 
tion between vertex attributes and dense subgraphs is ran- 
dom. Normalized structural correlation measures how the 
structural correlation of an attribute set deviates from its 
expected value, and allows us to assess the statistical signif- 
icance of a given structural correlation value. 

DEFINITION 5. (Normalized structural correlation 

). Given an attribute set S with support a(S) and a func- 
tion € eX p, which gives the expected structural correlation of 
an attribute set based on its support and the attributed graph 
Q, the normalized structural correlation of S is given by: 

texp\o\i>), y) 

According to Definition 5, the normalized structural cor- 
relation function gives how much the structural correlation 
of an attribute set S is higher than expected. Therefore, it 
requires the definition of the function e exp , which receives 
the support of S (o~(S)) and the attributed graph Q as argu- 
ments. By normalizing the structural correlation, we expect 
to obtain a measure of the correlation of an attribute set S 
that is independent of its support and the input graph. 

We assume that the input graph Q comprises the object of 
interest, i.e., it is the "population" graph. Assume that we 
are given the attribute set support value a(S) (independent 
of the actual attribute set S). To compute the expected 
structural correlation, our sample space is the set of all ver- 
tex subsets of size o~(S) drawn randomly from Q. The statis- 
tic of interest is the mean structural correlation value, e exp . 
That is, the expected probability that a random vertex in 
a given sample induces dense subgraphs (quasi-cliques) in 
that sample of size o~(S). The quasi-clique parameters, 7 m i n 
and minsize, are assumed to be fixed as well. 

An intuitive approach for computing e exp is through sim- 
ulation. Given the support a(S) of the attribute set, a ran- 
dom sample of a(S) vertices from Q is selected. Each vertex 
from the sample is checked to be in a quasi-clique, according 
to the quasi-clique parameters. The structural correlation 
of the sample is the fraction of vertices from it that are in at 
least one quasi-clique. The simulation-based expected struc- 
tural correlation sim-e exp is given by the average structural 
correlation of r random samples. 

The simulation-based structural correlation is very simple 
conceptually but may require a high r to achieve accurate 
estimates, which is prohibitive in real settings. Thus we 
also propose an analytical formulation for an upper bound 
on the expected structural correlation of an attribute set. 
The idea is that a vertex must have a minimum degree of 
\~( m in-(min_size — 1)] in order to be member of a 7 m i„- 
quasi-clique of minimum size minsize. Consequently, the 
probability of a vertex to have a degree of |~7 m j n . (minsize— 
1)] in a random subgraph of size o~(S) from Q gives an upper 
bound on the expected structural correlation of S. 

Given a random size a(S) subgraph <5 CT (S) from Q, the 
degree of v in Q and Q a (S) are related as follows. 
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THEOREM 1. (Probability of a vertex that has a 
degree a in Q to have a degree f3 in Q a is))- If a random 
vertex v from Q with degree a is selected to be part ofGa(S)i 
the probability of such vertex to have a degree f3 in <5 CT (S) * s 
given by the following binomial function: 



F(a,P,p) 



y.(i-pr 



(3) 



where p is the probability of a specific vertex u from Q to be 
in Qa(S), if v * s already chosen, which is given as: 



g(g) 
|V|- 



(4) 



Proof sketch. There are a vertices adjacent to v in Q , 
thus, the probability of v to have a degree of f3 in Q a (s) is 
the probability of selecting f3 out of a vertices to be part of 
Qtr(S) ■ Since v is already selected, the probability of selecting 
any remaining vertex from Q is given by equation 4- 

Based on Theorem 1, we define an upper bound on the 
expected structural correlation as the probability of a vertex 
to have a degree of at least \^ rn i n .(min_size — 1)] in Q a ts)- 

THEOREM 2. (Upper bound on the expected struc- 
tural correlation). Given the quasi-clique parameters y m in 
and min_size, the structural correlation of an attribute set 
with support a(S) is upper bounded by: 



max-e exp (a(S)) = ^p(a). ^ F(a, fi, p) 



(5) 



where z = \^ min .(minsize — If], m is the maximum degree 
of a vertex from Q, and p is the degree distribution of Q . 
Proof sketch. Given a vertex with degree a in Q , the prob- 
ability of such vertex to have a degree of at least 
\^min-(jninsize — 1)] in Q a (s) is the sum of expression 3 
over the degree interval from \^ min .{min-size — 1)] to a. If 
we multiply this sum by the probability of a vertex of degree 
a from Q to be in G a (s), i.e., p(a), it gives the probabil- 
ity of any vertex with degree a from Q to have a degree of 
at least \~/ min .(min_size — 1)] in Ga-(S)- Equation 5 is the 
sum of such products over the vertex degrees higher than 
\'jmin.(min_size — 1)] . 

The proposed upper bound on the expected structural 
correlation of an attribute S is based on the expected de- 
gree distribution of a random graph of size a(S) from Q. 
However, the degree is not the only criteria for a vertex to 
be part of a quasi-clique. Vertices that satisfy the minimum 
degree threshold may not be part of a quasi-clique if they 
are connected to low degree vertices. Nevertheless, since we 
apply the proposed formulation in order to normalize the 
structural correlation of attribute sets with different sup- 
ports, our objective is to provide a function that presents a 
slope that is similar to expected structural correlation. In 
Section 4.1, we compare the expected structural correlation 
computed using simulation with the proposed upper bound. 

We call Ssim and Sib the normalized structural correla- 
tion functions that apply the expected structural correla- 
tion based on simulation sim-e exp and the theoretical upper 
bound max-e exp , respectively. Since max-e exp > sim-e exp , 



e(S) 



e(S) 



bound on 8 S 



szm — € ex p 



8 aim , thus, Sib is a lower 



It is important to notice that max-e exp is monotonically 
non-decreasing, i.e., max-e exp (oi) > max-e £xp (a-f) if and 
only if <ti > 02- It follows directly from the fact that the 
analytical upper bound (Equation 5) is based on a cumula- 
tive binomial function, which is known to be monotonically 
non-decreasing w.r.t. p. We also assume that sim-e exp is 
monotonically non-decreasing for sufficiently high values of 
r, since an increase in the size of the random graphs selected 
from Q is not expected to decrease the probability of finding 
a vertex in a quasi-clique. Such properties will be exploited 
by our pruning techniques, which will be proposed further 
in this paper (see Section 3.2.1). 

We apply the normalized structural correlation in the iden- 
tification of statistically significant structural correlation val- 
ues. Therefore, we extend the structural correlation pat- 
tern mining problem (Definition 4) by adding a minimum 
normalized structural correlation threshold 8 min . Such a 
threshold may also be useful to improve the performance of 
structural correlation pattern mining algorithms, as will be 
discussed in Section 3.2. Since a user may be interested in 
patterns that have high structural correlation (e) as well as 
being statistically significant (8), we present results using 
both regular and normalized structural correlation. 

2.2 Related Work 

Finding communities [6, 3] and dense subgraphs [5, 10, 
8, 20] has been an active research topic. A community 
is usually defined as set of vertices significantly more con- 
nected among themselves than with vertices outside it [3]. 
On the other hand, dense subgraphs, such as cliques [18], 
are strongly based on internal cohesion and maximality. 

This work applies a dense subgraph definition called quasi- 
clique, which is a set of vertices where each vertex is con- 
nected at least to a fraction of the others. [14] introduces the 
problem of mining cross-graph quasi-cliques. They further 
studied the problem of mining frequent cross-graph quasi- 
cliques [8]. In [20] and [21] the authors study the problem 
of mining frequent coherent closed quasi-cliques. [10] stud- 
ies the problem of finding quasi-cliques from a single graph, 
proposing pruning techniques for quasi-clique mining. 

Graph clustering and dense subgraph discovery methods 
that consider vertex attributes as complementary informa- 
tion have attracted the interest of the research community 
in the recent years [12, 4, 22, 13]. A general assumption of 
these methods is that clusters based on both the topology of 
the graph and the attributes of vertices are more meaningful 
than those based only on the topology or the attributes. [4] 
proposes two efficient algorithms for the connected k-center 
problem, which has as objective to partition a graph consid- 
ering both the attributes and the topology. [22] proposes a 
random walk-based distance metric in an augmented graph 
where vertices from the original graph are connected to new 
vertices that represent vertex attributes. In [12], the authors 
introduce the problem of mining cohesive patterns, which 
are dense connected subgraphs where vertices have homo- 
geneous attributes (or features). [13] considers the problem 
of computing maximal homogeneous cliques in attributed 
graphs. Different from these methods, structural correla- 
tion pattern mining does not assume that vertex attributes 
are complementary information. In fact, we are interested 
in finding attribute sets that explain the formation of dense 
subgraphs through correlation. 

Assessing how vertex attributes are related to the graph 
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topology has led to the definition of new patterns. [15] 
proposed the problem of finding itemsct-sharing subgraphs, 
which consists of extracting subgraphs with common item- 
sets. It is important to notice that such method do not con- 
sider the density of subgraphs. [9] defines the proximity pat- 
tern mining, which evaluates how close vertex attributes are 
in the graph. A proximity pattern is a set of labels that co- 
occur in neighborhoods. Therefore, proximity patterns are 
not necessarily dense subgraphs or cohesive, differently from 
structural correlation patterns. In [7], the authors propose a 
different definition for the structural correlation, which com- 
pares the closeness among vertices induced by a given single 
attribute against a subgraph where attributes are randomly 
distributed. Our work differs from [7] by combining multiple 
attributes and considering a particular topological property 
which is the organization into dense subgraphs. Moreover, 
besides the evaluation of structural correlation of attribute 
sets, we are interested in the discovery of relevant dense sub- 
graphs to be representatives of the structural correlation. 

In [16], we introduce the structural correlation pattern 
mining and present an algorithm for this problem called 
SCORP. In this paper, we study the problem of identifying 
statistically significant structural correlation patterns based 
on a normalization of the structural correlation. We also 
present the SCPM algorithm, which extends SCORP with 
new pruning and search strategies for structural correlation 
pattern mining. Different from SCORP, SCPM enumerates 
the top structural correlation patterns in terms of size and 
density efficiently, instead of the complete set of patterns. 

3. ALGORITHMS 
3.1 Naive Algorithm 

Since structural correlation pattern mining combines as- 
pects of the frequent itemset mining and the quasi-clique 
mining problems, we may combine a frequent itemset min- 
ing algorithm and a quasi-clique mining algorithm into a 
naive algorithm for structural correlation pattern mining. 

The naive algorithm solves the structural correlation pat- 
tern mining problem (see Definition 4) by first enumerating 
the set of frequent attribute sets T from Q and then iden- 
tifying the set of quasi-cliques Q from the graph induced 
by each frequent attribute set S from T . The structural 
correlation of each frequent attribute set S is computed by 
checking whether each vertex v £ V(S) is part of a quasi- 
clique in Q. Frequent attribute sets can be identified using a 
frequent itemset mining algorithm [1, 19]. In this work, we 
apply the Eclat algorithm [19] . Moreover, any algorithm for 
quasi-clique mining can be applied by such naive algorithm. 
We apply the Quick algorithm [10]. 

The main drawback of the naive algorithm is that it enu- 
merates the complete set of frequent attribute sets from 
Q and the complete set of quasi-cliques from each induced 
graph G(S), where S is a frequent attribute set. Since the 
frequent itemset mining and the quasi-clique mining prob- 
lems are known to be #P-hard, the naive algorithm is ex- 
pected to not be able to process large attributed graphs. 
In order to achieve such goal, in the upcoming sections, we 
describe several strategies for efficient structural correlation 
pattern mining. We combine such strategies into a new al- 
gorithm, which is described in Section 3.2. Further in this 
paper, we compare the performance of the proposed algo- 
rithm against this naive method. 



3.2 SCPM Algorithm 

This section presents the SCPM (Structural Correlation 
Pattern Mining) algorithm, which applies several strategies 
in order to enable the structural correlation pattern min- 
ing in large attributed graphs. Unlike the naive algorithm, 
SCPM does not enumerate every frequent attribute set but 
prunes those attribute sets that cannot satisfy a minimum 
structural correlation threshold. Moreover, instead of identi- 
fying each quasi-clique from an induced graph, SCPM checks 
whether vertices are in quasi-cliques by verifying a reduced 
number of quasi-clique candidates. Finally, SCPM returns 
the set of the top-k most relevant structural correlation pat- 
terns from the attributed graph. 

3.2. 1 Pruning Strategies for SCP Mining 

This section presents pruning techniques for structural 
correlation pattern mining. The objective of these pruning 
techniques is to reduce the execution time of the structural 
correlation pattern mining algorithms without compromis- 
ing its correctness. Theorem 3 allows the pruning of vertices 
during the level- wise enumeration of attribute sets. 

THEOREM 3. (Vertex pruning for attribute sets). 

Let KLs be the set of vertices in dense subgraphs in the graph 
induced by an attribute set S . If Si C Sj , then K-Sj C K-Si ■ 
Proof sketch. Lets suppose that there exists a vertex v such 
that v € K-Sj and v ^ /Cs 4 . Since v € JCsj , there exists a 
dense subgraph V C V(Sj), such that v G V. Moreover, if 
v ^ ICsa there does not exist any dense subgraph U C V(Si) 
such that v £ U. Nevertheless, if Si C Sj, then V(Sj) C 
V(Si), which implies that V C V(Si) (contradiction) . 

Based on Theorem 3, we can prune vertices that are not in 
dense subgraphs in the graph induced by a given attribute 
set before extending it to generate larger attribute sets. At- 
tribute sets can also be pruned based on an upper bound on 
the structural correlation function, as stated by Theorem 4. 

THEOREM 4. (Attribute set pruning based on the 
upper bound on the structural correlation). For two at- 
tribute sets Si and Sj, if Si C Sj and a(Sj) > a m i n , then 
e(Sj) < e(Si).\V(Si)\/<T min 

Proof sketch. According to Theorem 3, e(Si).\V(Si)\ > 
e(Sj).\V(Sj)\, since every vertex covered by a dense sub- 
graph in V(Sj) is also covered by a dense subgraph in V(Si). 
Moreover, since o(Sj) ^ o~ m i n , e(S'j) is upper bounded by 
e(Si).\V(Si)\/a m in based on the definition of the structural 
correlation function e (see Definition 2). 

Given an attribute set Si, of size i, if e(Si).\ V(Si)\/o m i n < 
tmin, then Si is not included in the set of attribute sets to 
be combined for the generation of size i + 1 attribute sets. 
Theorem 4 guarantees that there does not exist an attribute 
set Sj, such that Si C Sj and e(Sj) > e min . A similar 
pruning rule can be formulated based on the normalized 
structural correlation function definition. 

THEOREM 5. (Attribute set pruning based on the 
upper bound on the normalized structural correlation). 

For two attribute sets Si and Sj, if Si C Sj, e exp is a mono- 

tonically non- decreasing, and o~(Sj) > o-min, then S(Sj) < 

e(Si) .\V (Si) \ / (e eX p(a m i n ) .a min ) 

Proof sketch. According to Theorem 4, 

e(Sj) < e(Si).\V(Si)\/a min . Since cr(Sj) > a min and e exp is 
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Figure 2: Set enumeration tree 



Algorithm 1 General Structural Correlation Algorithm 

Require: Q(S), i y m in, minsize 
Ensure: Q 
1: Q <- 

2: X «- 

3: candExts(X) V(5) 

4: Apply vertex pruning in candExts(X) 

5: qcCands «— {(JC, candi?2;ts(X))} 

6: while qcCands ^ do 

7: g qcCands. get() 

8: Apply candidate quasi-cliquc pruning in <j 

9: if g.X U q.candExts(X) is a quasi-cliquc then 

10: Q<-QU {g.X U q-candExtspf)} 

11: else 

12: if q.X is a quasi-cliquc then 

13: Q <- QU{q.X} 

14: end if 

15: insert extensions of q into qcCands 

16: end if 
17: end while 



monotonically non- decreasing, then e eX p(&(Sj)) > Ce^^Cr^n). 
Therefore, S(Sj) < e(S l ).\V(S i )\/(e exp { 

If 8(Si).\V(Si)\/(e e xp{o-min)-o-min) < 8 min , the attribute 
set Si, of size i, is not included in the set of attribute sets to 
be combined for the generation of size i + 1 attribute sets. 
Since Su, gives a lower bound on the normalized structural 
correlation, the whole pruning potential of Theorem 5 may 
not be explored. Nevertheless, the results show that use of 
Sib enables significant performance gains (see Section 4.2). 

The pruning strategy stated by Theorem 3 reduces the 
number of vertices to be checked to be in quasi-cliques in 
the computation of structural correlation. Theorems 4 and 
5 enable the reduction of the attribute sets for which the 
structural correlation is computed to a set that is expected 
to be smaller than the set of frequent attribute sets. 

3.2.2 Computing the Structural Correlation 

As discussed in Section 3.1, the naive algorithm computes 
the structural correlation of an attribute set S through the 
enumeration of the quasi-cliques from Q{S). In this section, 
we describe how the structural correlation can be computed 
by identifying a reduced number of quasi-clique candidates. 

Quasi-cliques can be enumerated based on a vertex set 
X, initially set as 0, and a set of candidate extensions of X, 
candExts(X), initially set as V. Vertices are moved from 
candExts(X) to X, one at a time, until the complete set 
of quasi-clique candidates are generated. Figure 2 shows 



a set enumeration tree that represents the search space of 
quasi-cliques considering a set of 4 vertices (1-4). In order 
to prune down such search space, quasi-clique mining algo- 
rithms apply several pruning techniques. We divide these 
techniques into two groups: 

1. Vertex pruning: Removal of vertices that cannot be 
part of any quasi-clique in Q according to the quasi- 
clique definition and the quasi-clique parameters. Ver- 
tex pruning is performed iteratively over the graph in 
order to minimize the search space of quasi-cliques. 

2. Candidate quasi-clique pruning: Removal of can- 
didate quasi-cliques (i.e., pairs (X, candExts(X))) 
from the search space of quasi-cliques. Such removal 
is based on the properties of the subgraph composed 
by vertices from X and candExts(X). 

Algorithm 1 gives a general description of how quasi- 
cliques are identified in the computation of structural corre- 
lation. This algorithm is also used as the basis for the enu- 
meration of the top-k structural correlation patterns. The 
algorithm receives an induced graph G{S), and the mini- 
mum density (7mm) and size (minsize) for quasi-cliques. 
It gives as output a set of quasi-cliques Q from Q. Vertex and 
quasi-clique candidate prunings are applied in lines 4 and 8, 
respectively. Candidate quasi-cliques are managed by the 
data structure qcCands, which will be discussed later. Each 
candidate pattern is checked to be a lookahead quasi-clique 
(i.e., q.X U q.candExts(X) is a quasi-clique) first, due to the 
fact that quasi-cliques are maximal. In case such a condi- 
tion does not hold, q.X is checked to be a quasi-clique and 
the extensions of q are inserted into qcCands (line 15). The 
algorithm finishes when qcCands becomes empty. The set 
ICs, which is composed of vertices covered by quasi-cliques 
in Q(S), can be obtained directly from Q. 

Since the quasi-clique mining problem is known to be #P- 
hard, the identification of quasi-cliques may require process- 
ing a large number of quasi-clique candidates, which would 
constitute an important limitation to the computation of 
the structural correlation of large induced graphs. Neverthe- 
less, computing the structural correlation does not require 
the enumeration of the complete set of quasi-cliques. The 
necessary information is whether each vertex from the in- 
duced graph is covered by a quasi-clique or not. Therefore, 
candidate quasi-cliques composed of vertices already known 
to be covered by quasi-cliques can be pruned from the new 
quasi-clique candidates generated in line 15 of Algorithm 1. 

Besides pruning candidate quasi-cliques that are already 
known to be covered by dense subgraphs, we also propose 
search strategies for computing the structural correlation. 
These search strategies determine the order in which can- 
didate quasi-cliques are enumerated. A breadth-first search 
(BFS) strategy for computing the structural correlation tra- 
verses the search space of quasi-cliques in a breadth-first 
order, starting from the root and visiting the smaller vertex 
sets before the larger ones. On the other hand, a depth- 
first search (DFS) strategy extends vertex sets as much as 
possible. The BFS strategy is expected to perform better 
in case covering vertices with smaller quasi-cliques is more 
efficient than with larger quasi-cliques. Considering a set 
of 4 vertices, for which the search space of quasi-cliques is 
shown in Figure 2, the BFS and the DFS strategy visit the 
quasi-clique candidates as follows: 
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Algorithm 2 SCPM Algorithm 



miii_size. 



Require: Q, o~ m i n , 7™ 
Ensure: V 

1: V <- 
2: T <- 

3: X ^frequent attributes from Q 
4: for all S £ I do 
5: e 4— structural correlation of S 
6: if e > e mi „ AND e/e exp (S) > <5 m 
7: Qf- top-k patterns from Q(S) 

8: for all q £ Q do 

9: P<-PU(S, g ) 

10: end for 

11: end if 

12: if e.cr(S) > e„ 
then 

13: r^rus 

14: end if 
15: end for 

16: V 4— PU enumerate-patterns (7~ , <? 



then 



AND e.cr(S) > <5 m i„.e exp (cr mi „). a mi „ 



. inm_stze, e„ 



Algorithm 3 enumerate-patterns 



Require: T, S, cr„ 
Ensure: P 



in, 7Tnin, minstze, e n 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 

16 
17 
18 
19 
20 
21 

22: 



V <- 

for all Si £ T do 

n <- 

for all £ T do 
if z > J then 

s <- Si u Sj 

if cr(S) > a min then 

e 4— structural correlation of S 
if e > e mi „ AND e/e e:cp (S) > S mi 

Q 4— top-k patterns from Q(S) 

for all q £ Q do 

end for 
end if 

if e.cr(S) > Cmin.amin 
<> m in-£exp(<7 m in)-<? m in then 

Rf-RUS 
end if 
end if 
end if 
end for 

V <— ?U enumerate-patterns (1Z, Q , d mtn 
e m i„,S m i n ,k) 
end for 



then 



AND e.o-(S) > 



. rmnsize, 



BFS: {1}, {2}, {3}, {4}, {1, 2}, ... {1, 2, 3, 4}. 
DFS: {1}, {1,2}, {1,2,3}, {1,2,3,4}, {1,3}, . 



{4}- 



Quasi-cliques can be enumerated in BFS order by using a 
queue as a data structure to manage quasi-clique candidates 
in Algorithm 1. Similarly, a DFS strategy for enumerating 
quasi-cliques can apply a stack in order to manipulate candi- 
date patterns. Further in this paper, we evaluate the search 
strategies presented in this section. 

3.2.3 Enumerating Top-k Patterns 

As discussed in Section 2.1.2, enumerating structural cor- 
relation patterns is a computationally expensive task. In 
this section, we study how to reduce the cost of enumerat- 
ing structural correlation patterns by restricting the output 
set to only the top-k most relevant patterns in terms of size 
(primary criteria) and density (secondary criteria). 

The enumeration of the top-k structural correlation pat- 
terns follows the same procedure described in Algorithm 1. 
We use a DFS strategy in the discovery of the top-k pat- 
terns because structural correlation patterns are maximal 
(see Definition 3). However, since the number of patterns to 



be discovered is known, a current set of patterns can be ap- 
plied to prune the search space of new candidates. New can- 
didate quasi-cliques are generated in line 15. In case the cur- 
rent set of top patterns contains k patterns and a candidate 
pattern p cannot produce a pattern larger than the smallest 
current top-k pattern t (i.e., \p.X L) p.candExts(X) < \t\), p 
can be pruned. By updating the set of top-k patterns, the 
minimum size threshold is increased iteratively. As a conse- 
quence, the top-k patterns are enumerated more efficiently 
than the complete set of patterns from an induced graph. 

Algorithm 2 is a high-level description of the SCPM al- 
gorithm, which applies the strategies for efficient structural 
correlation pattern mining presented in this section. The 
initial set of attributes X is composed by those with a sup- 
port of at least a m i n (line 3). The structural correlation of 
each size one attribute set S G X is computed as described 
in Section 3.2.2. In case the structural correlation of S sat- 
isfies minimum structural correlation (e m ;„) and normalized 
structural correlation (5 m i„) thresholds, the top-k patterns 
induced by S are identified using the algorithm described 
in this section (line 7). These patterns are included into a 
set of patterns V that will be given as output. The pruning 
rules for attribute sets based on e and 5 (see Section 3.2.1) 
are applied in line 12. Pruned attributes are not included 
into the set of attributes T to be extended. These attributes 
are extended by the function enumerate-patterns (line 16). 

Algorithm 3 describes the function enumerate-patterns. 
It receives the same input parameters of SCPM, and also 
the set of patterns to be extended T. It returns the set 
of top-k patterns (S, V) that have attribute sets extended 
from those in T regarding the input parameters. New at- 
tribute sets are extended through the union of existing ones 
(line 6). Attribute sets are traversed in a DFS order (e.g., 
{A}, {A, B}, {A, B,C} ... {E}). The enumerate-patterns 
function is similar to Algorithm 2, except that each new 
attribute set S is checked to satisfy the minimum support 
threshold a min (line 7). All valid attribute sets are gener- 
ated through recursive calls to enumerate-patterns (line 21). 

4. EXPERIMENTAL RESULTS 

This section presents case studies on the structural cor- 
relation pattern mining using real datasets. Moreover, we 
evaluate the performance and study the sensitivity of impor- 
tant input parameters of SCPM. Experiments were executed 
on a 16-core Intel Xeon 2.4 Ghz with 50GB of RAM. The 
implementations are available as open-source 1 . 

4.1 Case Studies 

4.1.1 DBLP 

In the attributed graph extracted from the DBLP 2 digital 
library, each vertex represents an author and two authors are 
connected if they have co-authored a paper. The attributes 
of authors are terms that appear in the titles of papers au- 
thored by them 3 . In the DBLP dataset an attribute set de- 
fines a topic (i.e., set of terms that carry a specific meaning 
in the literature) and a dense subgraph is a community. 

The DBLP dataset has 108,030 vertices, 276,658 edges 
and 23,285 attributes. Table 2 shows the top 10 attribute 

: http : //code . google . com/p/ scpm/ 

2 http : / /www . inf ormat ik . uni- trier . de/~ley/ db 

3 Stemming and removal of stop words were applied. 
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(a) Graph induced by {search, rank} 



(b) Pattern induced by {perform, system} 



Figure 3: Examples of results from the DBLP dataset 
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Table 2: DBLP - Top support (<j), str. correlation (e), and normalized str. correlation (Sib) attribute sets. 
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Figure 4: DBLP - Expected e computed by the sim- 
ulation (sim-e exp ) and analytical (max-t elp ) models. 



sets w.r.t support (<r), structural correlation (e), and nor- 
malized structural correlation (Sib). The minimum size 
(minsize) and density (■y m in) parameters were set to 10 and 
0.5, respectively. The minimum support threshold (a m m) 
was set to 400 and we considered only attribute sets of size 
at least 2. The parameters used in our case studies were 
selected empirically. 

Top-a attribute sets present a low correlation with the 
formation of dense subgraphs in the DBLP dataset. Such 
terms are popular in paper titles, but do not carry much 
knowledge regarding the formation of research communi- 



ties. On the other hand, top-e structural correlation may 
be more easily associated to known topics in computer sci- 
ence. The attribute set {grid, applic} has the highest struc- 
tural correlation (0.26), i.e., 26% of the authors that have 
the keywords "grid" and "applic" are inside a community of 
researchers of size at least 10 where each of them have col- 
laborated with half of the other members. It is interesting 
to point out that the graph induced by {grid, applic} has 
more vertices in dense subgraphs than the graph induced by 
{base, system}, though {base, system} is more than 6 times 
more frequent than {grid, applic} . In general, high support 
attribute sets do not present high structural correlation. 

Figure 4, shows the expected structural correlation for 
different support values in the DBLP dataset. The input 
parameters are the same as those used to generate the re- 
sults shown in Table 2. For the simulation model, we ex- 
ecuted 1000 simulations for each support value and show 
also the standard deviation of the expected structural cor- 
relation estimated. The analytical upper bound is not tight 
w.r.t. the simulation results, but presents a similar growth, 
which shows that it enables accurate comparisons between 
the structural correlation of attribute sets. 

Based on the proposed analytical model, the third column 
of Table 2 shows the top attribute sets in terms of analyti- 
cal normalized structural correlation (Sib)- The attribute set 
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(a) Graph induced by {S Stevens, Wilco} (b) Pattern induced by {Van Morrison} 



Figure 5: Examples of results from the LastFm dataset 
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Table 3: LastFm - Top support (a), str. correlation (e), and normalized str. correlation (Sib) attribute sets. 



{search, rank} has the highest normalized structural corre- 
lation (635,349), i.e., the structural correlation is 635,349 
times the upper bound on its expected structural correla- 
tion given by the analytical model. Figure 3(a) presents the 
graph induced by {search, rank}. Vertices contained in a 
dense subgraph are indicated. Dense subgraphs cover the 
densest components of the induced graph. In general, top- 
er attribute sets have low Sib when compared to the top-t^i, 
attribute sets. Moreover, high values of e do not necessar- 
ily lead to high values of 8^. Figure 3(b) shows the largest 
structural correlation pattern in terms of number of vertices 
from DBLP, which represents two important interconnected 
research groups on high performance systems. 
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Figure 7: LastFm - Expected e computed by the sim- 
ulation (sim-e exp ) and analytical (max-f elp ) models. 



4.1.2 LastFm 

LastFm 4 is an online social music network. We use a sam- 
ple of the LastFm users crawled through an API provided 
by LastFm. In the LastFm network, vertices represent users 
and edges represent friendships. The attributes of a vertex 
are the artists the respective user has listened to. An at- 
tribute set in the LastFm dataset represents, in a more gen- 
eral interpretation, a musical taste (i.e., set of artists) and 
a dense subgraph is a community. 

The LastFm dataset contains 272,412 vertices, 350,239 
edges, and 3,929,101 attributes. Table 3 shows the top 10 
attribute sets in terms of support (<r), structural correlation 
(e) and normalized structural correlation (Sib) discovered 
from LastFm. The minimum size (minsize) and density 



4 http : //www. last . fm 



(7min) parameters were set to 5 and 0.5, respectively. The 
minimum support threshold (a min ) was set to 27,000. 

In general, the top-e attribute sets are the most frequent 
ones. However, such attribute sets present low normalized 
structural correlation. In other words, although these at- 
tributes are frequent and have several vertices covered by 
communities, this coverage is not much higher than ex- 
pected. Considering the normalized structural correlation, 
which takes into account the expected structural correla- 
tion of an attribute set, the top patterns change signifi- 
cantly. Figure 7 shows the expected structural correlation 
for support values varying from 20,000 to 100,000. Each 
simulation-based expected structural correlation value cor- 
responds to an average of 100 simulations. The top Sw at- 
tribute set {S Stevens,Wil co } includes the American singer 
and songwriter Sufjan Stevens and the American band Wilco. 
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(a) Graph induced by {node,wireless} (b) Pattern induced by {perform, system} 



Figure 6: Examples of results from the CiteSeer dataset 
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Table 4: CiteSeer - Top support (a), str. correlation (e), and normalized str. correlation (Sib) attribute sets. 



Figure 5(a) shows the graph induced by the attribute set 
{S Stevens,Wilco}. For clarity, we removed vertices with 
degree lower than 2. By visualizing vertices inside and out- 
side structural correlation patterns, we can understand how 
the structural correlation captures the relationship between 
attributes and dense subgraphs. The largest structural cor- 
relation pattern found is presented in Figure 5(b). It rep- 
resents a community of 34 users who have listened to the 
Northern Irish singer and songwriter Van Morrison. Vertex 
identifiers are not shown due to privacy issues. 

4.1.3 CiteSeer 

CiteSeerX 5 is a scientific literature digital library and 
search engine. We built a citation graph from CiteSeerX 
as of March of 2010. In the CiteSeer graph, papers are rep- 
resented by vertices and citations by undirected edges. Each 
paper has as attributes terms extracted from its abstract 6 . 
Attribute sets represent topics and dense subgraphs define 
groups of related work in the CiteSeer graph. 

The CiteSeer dataset has 294,104 vertices, 782,147 edges, 
and 206,430 attributes. The parameters setting applied in 
this case study is a min = 2000, minsize = 5, and 7 m i„ = 
0.5. Table 4 shows the top structural correlation attribute 
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5 http:/ /citeseerx. ist.psu.edu 

6 Stemming and stop words removal were applied. 



Figure 9: CiteSeer - Expected e for the simulation 
(sim-e exp ) and analytical (max-t elp ) models. 



sets w.r.t. a, e, and Sit discovered. Top-cr attribute sets 
present low structural correlation and normalized structural 
correlation when compared to the top-e and top-<5;b attribute 
sets, respectively. Moreover, similar to the DBLP dataset, 
while the top-<r attribute sets from the CiteSeer dataset are 
generic terms, the top-e and top-Su, attribute sets may be 
easily associated to known research topics (e.g, computer 
networks, query optimization). 

Figure 9 shows the expected structural correlation for dif- 
ferent support values in CiteSeer. The attribute set {node, 
wireless} has the highest normalized structural correlation 
(Sib =164.40). Figure 6(a) shows the graph induced by 
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the attribute set {node, wireless} in CiteSeer. Figure 6(b) 
presents the largest structural correlation pattern discovered 
in the CiteSeer dataset. Vertex labels are the initials of pa- 
per titles. The papers included in the pattern cover topics 
such as caching, memory management, computer networks, 
processor design, and instruction level optimization (e.g., 
Attribute Caches, Systems for Late Code Modification, Lim- 
its of Instruction Level Parallelism, Link-time Optimization 
of Address Calculation on a 64-bit Architecture). We do not 
show the full list of paper titles due to space limitations. 

4.2 Performance Evaluation 

This section evaluates the performance of the structural 
correlation pattern mining algorithms. The dataset used is 
a smaller version of the DBLP dataset (SmallDBLP), which 
has 32,908 vertices, 82,376 edges, and 11,192 attributes. 

The SCPM-BFS and SCPM-DFS are versions of the 
SCPM algorithm using the BFS and DFS strategy, respec- 
tively. The Naive algorithm enumerates the complete set of 
quasi-cliques from the induced graphs, as described in Sec- 
tion 3.1. We vary each parameter of the algorithms keeping 
the others constant. Default values for r y m in, minsize, and 



are 0.5, 11, and 100. Moreover e 



mzn, u m%n, 



and k are 

set to 0.1, 1, and 5, respectively, unless stated otherwise. 

Figures 8(a), 8(b), and 8(c) show the runtime of the al- 
gorithms varying the values of 'ymin, min_size, and a m i n , 
respectively. In general, SCPM-DFS achieves the best re- 
sults, being up to 3 orders of magnitude faster than the 
Naive algorithm. Moreover, SCPM-BFS performs better 
than the Naive algorithm in all the experiments. 

In terms of the e min (Figure 8(d)) and 5 m i n (Figure 8(e)) 
parameters, both the SCPM-BFS and SCPM-DFS apply 
the pruning techniques described in Section 3.2.1. Based on 
the results shown in Figures 8(d) and 8(e), we can notice 
that such techniques lead to significant performance gains 
when the values of e m in and 8 m in are increased. 

In Figure 8(f), we show the runtime of SCPM-DFS and 
the Naive algorithm for different values of k. The results of 
SCPM-BFS are omitted because both SCPM-BFS and 
SCPM-DFS algorithms apply the same strategy for iden- 
tifying the top-k structural correlation patterns (see Section 
3.2.3). The inset also shows the execution time of SCPM- 



DFS using a linear scale for the y-axis, to more clearly see 
the effect of k on the runtime. The results show that for low 
values of top k, SCPM-DFS is able to achieve low running 
times, outperforming the Naive algorithm significantly. 

4.3 Parameter Sensitivity and Setting 

We now assess how different input parameters affect the 
output of structural correlation pattern mining. Our objec- 
tive is to provide guidelines for setting the parameters of 
SCPM. Figure 10 shows the average structural correlation 
and normalized structural correlation of the complete output 
(global) and the top-10% attribute sets from the SmallDBLP 



dataset varying the 7„ 



minsize. 



and 



parameters. 



Default values for j m in, minsize, and a m in are 0.5, 10 and 
100. The results show that more restrictive quasi-clique pa- 
rameters (i.e., high values of 7 m i„ and minsize) reduce the 
average e but may increase S, since dense subgraphs become 
less expected. Moreover, high values of <r m i n are related to 
high values of structural correlation e. However, such at- 
tribute sets also present high values of e exp , leading to low 
values of normalized structural correlation S. 

SCPM is an exploratory pattern mining method, and thus 
reasonable values for the different parameters can be ob- 
tained by searching the parameter space. The minimum 
density parameter, y m in, and the minimum quasi-clique size, 
min_size, will depend on the application. For a min , a use- 
ful guideline is to select values that produce a significant 
expected structural correlation. Infrequent attribute sets 
may not be expected to induce any dense subgraph. The 
other parameters (e m in, S m i n , and k) have as objectives to 
speedup the algorithm and must be set according to the 
available computational resources and time. 



5. CONCLUSIONS 

In this paper, we studied the problem of correlating vertex 
attributes and dense subgraphs in large attributed graphs. 
The concept of structural correlation, which measures how 
an attribute set induces dense subgraphs in an attributed 
graph was proposed. We also presented normalization ap- 
proaches that compare the structural correlation of a given 
attribute set against its expected value, which provides a 
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measure of the statistical significance for the structural cor- 
relation. In order to enable the analysis of large databases, 
we introduced search and pruning strategies for structural 
correlation pattern mining. We also proposed an algorithm 
for the identification of the top structural correlation pat- 
terns, which are the largest and densest subgraphs induced 
by a given set of attributes. The patterns and algorithms 
proposed were applied to three real datasets. The attribute 
sets and patterns found represent relevant knowledge in terms 
of the correlation between attributes and dense subgraphs. 
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